A Similarity - based Probability Model for Latent Semantic IndexingChris

نویسنده

  • Chris H.Q. Ding
چکیده

A dual probability model is constructed for the Latent Semantic Indexing (LSI) using the cosine similarity measure. Both the document-document similarity matrix and the term-term similarity matrix naturally arise from the maximum likelihood estimation of the model parameters, and the optimal solutions are the latent semantic vectors of of LSI. Dimensionality reduction is justiied by the statistical signiicance of latent semantic vectors as measured by the likelihood of the model. This leads to a statistical criterion for the optimal semantic dimensions , answering a critical open question in LSI with practical importance. Thus the model establishes a statistical framework for LSI. Ambiguities related to statistical modeling of LSI are clariied.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

Question Retrieval Based on Probabilistic Latent Semantic Analysis in Q & A Community

With the increasingly popularity of Q & A Community, it has become an important means for people to retrieve question from question library to find the answer. Similarity calculation is the core issue in Q & A community, and the appropriate calculation method is the key factor that affects the quality of question retrieval. This paper proposes a retrieval method based on PLSA model. Firstly, we...

متن کامل

Probabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data

We propose the first probabilistic approach to modeling cross-lingual semantic similarity (CLSS) in context which requires only comparable data. The approach relies on an idea of projecting words and sets of words into a shared latent semantic space spanned by language-pair independent latent semantic concepts (e.g., crosslingual topics obtained by a multilingual topic model). These latent cros...

متن کامل

Similarity Measures Based on Latent Dirichlet Allocation

We present in this paper the results of our investigation on semantic similarity measures at wordand sentence-level based on two fully-automated approaches to deriving meaning from large corpora: Latent Dirichlet Allocation, a probabilistic approach, and Latent Semantic Analysis, an algebraic approach. The focus is on similarity measures based on Latent Dirichlet Allocation, due to its novelty ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999